Introduction:

In this notebook I will try to investigate the SWAG datasets.
The idea is to understand how to deal with multiple choice datasets and how to prepare them for the next step.
Multiple choice is frequent problem in the filed of LLMs and NLP in general
So the preprocessing of data will have a huge effect on the success of any proposed solution

# load the dataset
from datasets import load_dataset
dataset = load_dataset('swag', 'regular')

# let's grab a sample
dataset['train'][0]

{'video-id': 'anetv_jkn6uvmqwh4',
 'fold-ind': '3416',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'gold-source': 'gold',
 'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'label': 0}

These fields represent the idea begind this dataset
- a situation where we have to predict the right ending
- sent1 and sent2 represent the given situation and they added up to startphrase
- endings 0 to 3 represent the the endings for that situation, only one is the right
- label index the right answer
Now let’s initialized BERT and load its tokenizer.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

The idea here is to tokenize a start sentence with each one of the 4 choices,

ending_names = ["ending0", "ending1", "ending2", "ending3"]


def preprocess_function(examples):
    first_sentences = [[context] * 4 for context in examples["sent1"]]
    question_headers = examples["sent2"]
    second_sentences = [
        [f"{header} {examples[end][i]}" for end in ending_names]
        for i, header in enumerate(question_headers)
    ]

    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}

To understand each operation of that function we will do it step-by-step:
- First let’s create a sub-set of the training set
- Create a endings list that we will use later

endings = ["ending0", "ending1", "ending2", "ending3"]
train_ds = dataset['train']
smp = train_ds[:20]

Multiply each sent1 by 4 and stack them all in a list:

sent_1 = [[sent] * 4 for sent in smp['sent1']]

Let’s retrieve the length of that list and see what’s inside one element of it.

sent_1[2], len(sent_1)

(['A group of members in green uniforms walks waving flags.',
  'A group of members in green uniforms walks waving flags.',
  'A group of members in green uniforms walks waving flags.',
  'A group of members in green uniforms walks waving flags.'],
 20)

So basically we have 4 copies of each first-sentence of the dataset.
Now we will create a list of the second-sentence or the header.

headers = smp['sent2']

At this point we have:
- sent_1 which each element is multiplied by 4
- headers that complete sent_1
The idea here is to create pairs of each header +sent_2 for each sent_1.

sent_2 = [[f'{head}{smp[end][i]}' for end in endings] for i, head in enumerate(headers)]

Now we need to flatten the pair of sentences, so we could tokenize them:

frst_sent = sum(sent_1, [])
scnd_sent = sum(sent_2, [])
tok_smp = tokenizer(frst_sent, scnd_sent, truncation=True)

We tokenize the pair of list sentences which will return a dictionary with 3 keys:

tok_smp.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

But since we already flattend the pairs before the tokenization step, we need to get them unflatten again so we can pass it through the map() function in order to be computed by the model.

outputs = {k: [v[i: i + 4] for i in range(0, len(v), 4)] for k, v in tok_smp.items()}

Lets check if we get the unflatten step right, we just need to make sure that the input_ids of the first sentence has the same values in both: tok_smp and outputs:

flatten_smp = tok_smp['input_ids']
unflatten_smp = outputs['input_ids']

flatten_smp[0:4] == unflatten_smp[0]

True